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(54) Apparatus and method for loop execution 

(57) A processing engine, comprising an execution 
unit for executing nnachine-readable instructions dis- 
patched thereto, and an instruction buffer for temporar- 
ily storing a plurality of machine-readable instructions 
transferred thereto prior to dispatch to the execution 
unit. The execution unit is responsive to a first machine- 
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readable instruction to initiate iterative execution of a 
block of machine-readable instructions. Including an Ini- 
tial and a final instruction for said block, all locatable in 
said instruction buffer. During iterative execution, 
instruction fetch into the Instruction buffer is Inhibited. 
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Description 

FIELD OF INVENTION 

[0001] The present invention relates to processing 
engines, and in particular but not exclusively, to 
processing engines configurable to repeat program 
flow. 

BACKGROUND OF INVENTION 

[0002] It is known to provide for parallei execution of 
instructions in microprocessors using multiple instruc- 
tion execution units. Many different architectures are 
known to provide for such parallel execution. Providing 
parallel execution increases the overall processing 
speed.. Typically, multiple instructions are provided in 
parallel in an instruction buffer and these are then 
decoded in parallel and are dispatched to the execution 
units. Microprocessors are general purpose processing 
engines which require high instruction throughputs in 
order to execute software running thereon, which can 
have a wide, range of processing requirements depend- 
ing on the particular software applications involved. 
Moreover, in order to support parallelism, complex oper- 
ating systems have been necessary to control the 
scheduling of the instructions for parallel execution. 
[0003] Many different types of processing engine 
are known, of which microprocessors are but one exanrv 
ple. For example. Digital Signal Processors (DSPs) are 
widely used, in particularfor specific applications. DSPs 
are typically configured to optimise the performance of 
the applications concerned and to achieve this they 
employ more specialised execution units and instruction 
sets. They have not, however, been provided with the 
parallel instruction execution architectures found in 
microprocessors and do not operate under the operat- 
ing systems used to control such microprocessors. 
[0004] In a DSP or microprocessor, machine-reada- 
ble instructions stored in a program memory are 
sequentially executed by the processor in order for the 
processor to perform operations or functions. The 
sequence of machine-readable instructions is termed a 
"program". Although the program instructions are typi- 
cally performed sequentially, certain instructions permit 
the program sequence to be brpken, and for the pro- 
gram flow to repeat a block of instructions. Such repeti- 
tion of a block of instructions is known as "looping", and 
the block of instructions are known as a "loop". 
[0005] When performing a loop, memory access, to 
program memory for example, has to be performed in 
order to fetch the instructions to be repeated. Typically, 
the memory such as the program memory, resides off 
chip, and the memory access represents a considerable 
processor cycle overhead. This mitigates against power 
saving and fast processing at low power consumption, 
in particular for applications where program loops are 
likey to be utilised frequently. 



[0006] The present invention is directed to improv- 
ing the performance of processing engines such as. for 
exarinple but not exclusively, digital signal processors. 

5 SUMMARY OF INVENTION 

[0007] Particular and preferred aspects of the 
invention are set out in the accompanying independent 
claims. Combinations of features from the separate 
10 independent claims and the dependent claims maybe 
combined as appropriate and not merely as' explicitly 
set out in the claims. 

[0008] . In accordance with the first aspect of the 
invention there is provided a processing engine com- 

15 prising an execution unit for executing machine-reada- 
ble instructions despatched to the execution units. The 
processing engine also comprises an instruction buffer 
for tenriporarily storing a plurality of machine-readable 
instructions which have been transferred to the instruc- 

20 tion buffer pribr'to their dispatch to the execution unit. 
The execution unit responds to a first machine-readable 
instruction to initiate an iterative execution of a block of 
machine-readable instructions locatable in the instruc- 
tion buffer. The block of machine-readable instructions 

25 include both an initial and a final instruction. 

[0009] In accordance with a second aspect of the 
invention there is provided a method for operating a 
processing engine comprising temporarily storing a plu- 
rality of instructions in a instruction buffer. The method 

30 further comprises repeatedly executing a block of 
instructions of said plurality of instructions. 
[0010] An advantage of a preferred embodiment in 
accordance with the first and second aspects of the 
invention is that a block of instructions may be repeat- 

36 ediy executed without having to access program menrv 
cry or cache memory, since all the instructions are 
already loaded in the instruction buffer. Accordingly, 
there is a reduced number of processor cycles neces- 
sary for execution of the block of instructions and conse- 

40 quently a reduction in power consumption, compared to 
conventional systems in which instructions comprising a 
repeated block are fetched from program memory each 
tirine they are to be executed. 

[0011] Preferably the first instruction comprises an 
45 offset which indicates the relative positions of a final 
instruction and an initial instruction of the block of 
, instructions. This advantageously indicates to the 
processing engine boundaries of the block of instruc- 
tions to be repeated. Additionally, a check maybe car- 
50 ried out on the offset to ensure that the final instruction 
will be located within the instruction buffer. Suitably, this 
check maybe carried out during compilation or assem- 
bly of instruction code for the processing engine. Advan- 
tageously relative addressing may be used for the 
55 offset which uses less code than absolute addressing. 
[0012] In a preferred embodiment the processing 
engine comprises an instruction pipeline having a plu- 
rality of pipelines stages. At least one of these instruc- 
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tion pipeline stages is an instruction fetch stage, during 
which instruction code is fetched from a program mem- 
ory in four byte clutches irrespective of instruction 
boundaries for transfer into the instruction buffer 
Advantageously, the fetch stage maybe inhibited for the 
instruction buffer being full of instructions. Optionally, 
the fetch stage maybe inhibited subsequent to the fetch- 
ing of a final instruction for the block of instructions from 
the program memory into the instruction buffer, thus, 
no further processor cycles, and consequently no power 
in performing those cycles, are used for fetching instruc- 
tions from program memory once the" instruction buffer 
is full or once the final instruction for the block of instruc- 
tions has been fetched in to the instruction buffer. 
[0013] Typically the processing engine includes 
both a local read program counter and a local write pro- 
gram counter, respectively pointing to the location in 
instruction buffer of the currently dispatched instruction 
and the location for writing the next .instruction trans- 
ferred from the :program memory. The difference 
between the local read program counter and local write 
program counter maybe monitored, and once that differ- 
ence equals a predetermined value, for example the 
size of the instruction buffer, then the transfer of instruc- 
tions, that is to say the fetch stage, to said instruction 
buffer may be inhibited. Optionally, the inhibition of the 
transfer of instructions from the program memory may 
be Initiated for the difference between the. local read 
program counter and local vvrite program counter being 
less than the instruction buffer size. Such a configura- 
tion would provide a margin at the boundaries of the 
instruction buffer to account for any instruction which 
may go over the boundaries of fetch packets or dispatch 
packets of instructions. 

[0014] Suitably, the processing engine . is respon- 
sive to a second instruction to store an iteration count 
for repeat execution initiated by the first instruction, in 
this manner the number of iterations of the buffer 
instructions can be set up. Generally the iteration count 
is decremented for each pass of the block of instruc- 
tions. However, the count step may not necessarily be 
equal to one. but sum, other suitable integer .value 
dependent upon the circumstances. 
[001 5] The processing engine is generally operable 
to restart the instruction fetch stage responsive to a last 
iteration of the body of instructions. Typically, the restart 
of the fetch stage is responsive to the dispatch of the ini- 
tial instruction for said last iteration. 
[001 6] In a preferred aspect of the invention there is 
provided a data processing system comprising a 
processing engine as referred to above and a.computer 
comprising an assembler or compiler operable to align 
the initial instruction and/or the final instruction with a 
respective boundary of the instruction buffer in accord- 
ance with an operator instruction. This advantageously 
provides for optimisation of the block size, since all of 
the instruction buffer may be used by the block since 
such alignment provides for exact alignment with the top 



and bottom boundaries of the Instruction buffer. Suita- 
bly, the assembler or compiler insert no operation codes 
into the program code to cause said alignment. 
[0017] In accordance with the prefen*erd embodi- 

5 ments of the invention, fewer processing cycles are nec- 
essary in order to perform repeat loops since no 
memory access is required during the loop. This is still 
the case for processing engines having an instruction 
cache before the instruction buffer, since it is not neces- 

10 sary to access the cache memory during the local loop. 
Since there is no need to access memory outside of the 
processing engine, there is no corresponding switching 
of large busses ( address/data busses) to access code 
within a large memory (typically 2K). such access typi- 

15 cally consuming more power than a local fetch In the 
processing engine. Consequently, there, is a corre- 
sponding reduction in power consumption by the 
processing engine. Therefore, embodiments of the 
invention are particularly suitable for use in portable 

20 apparatus, such as wireless communication devices. 
Typically such a wireless communication device com- 
prise a user interface including a display such as liquid 
crystal display or a TFT display, and a keypad or key- 
board for inputting data to the communications device. 

25 Additionally, a wireless communication device will also 
comprise an antenna for wireless communication vi/ith a 
radio telephone network or the like. 

BRIEF DESCRIPTION OF T HF DRAWINGS 

30 

[0018] Particular embodiments in accordance with 
the invention will now be described, byway of example 
only, and with reference to the accompanying drawings 
in which like reference signs are used to denote like 
35 parts, unless otherwise stated, and in which: 

Figure 1 is a schematic block diagram of a proces- 
sor in accordance with an embodiment of the inven- 
tion; 

40 Figure 2 is a schematic diagram of a core of the 
processor of Figure 1 ; 

Figure 3 is a more detailed schematic block dia- 
gram of various execution units of the core of the 
processor of Figure 1 ; 

45 Figure 4 is schematic diagram of an Instruction 
buffer queue and an instruction decoder controller 
of the processor of Figure 1 ; 
Figure 5 is a representation of pipeline phases of 
the processor of Figure 1; 

50 Figure 6 is a diagrammatic illustration of an exam- 
ple of operation of a pipeline in the processor of Fig- 
ure 1 ; 

Figure 7 is a schematic representation of the core 
of the processor for explaining the operation of the 
55 pipeline of the processor of Figure 1 ; 

Figure 8 is a grid illustrating the status of an instruc- 
tion buffer queue during a local loop in accordance 
with an embodiment of the invention; and 
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Figure 9 is a schematic illustration of a wireless 
communication device suitable for incorporating in 
an embodiment of the invention. 

DESCRIPTION OF PARTICULAR EMBODIMENTS 5 

[0019] Although the invention finds particular appli- 
cation to Digital Signal Processors (DSPs), imple- 
mented for example in an Application Specific 
Integrated Circuit (ASIC), it also finds application to io 
other forms of processing engines. . 
[0020] The basic architecture of an example of a 
processor according to the invention will now be 
described. 

[0021] .Figure 1 is a schematic overview of a proc- 15 
essor 10 forming an exemplary embodiment of the 
present invention. The processor 10 includes a 
processing engine 100 and a processor backplane 20. 
In the present embodiment, the processor is a Digital 
Signal Processor 10 implemented in an Application 20. 
Specific Integrated Circuit (ASIC). 
[0022] As shown in Figure 1 , the processing engine 
100 forms a central processing unit (CPU) with a 
processing core 102 and a memory interface, or man- 
agement, unit 104 for interfacing the processing core 25 
102 with memory units external to the processor core 
102.: 

[0023] The processor backplane 20 comprises a 
backplane bus 22. to which the memory management 
unit 104 of the processing engine is connected. Also 30 
connected to the backplane bus 22 is an instruction 
cache memory 24, peripheral devices 26 and an exter- 
nal interface 28. 

[0024] It will be appreciated that in other embodi- 
ments, the Invention could be implemented using differ- 35 
ent configurations and/or different technologies. For 
example, the processing engine 100 could form the 
processor 10, with the processor backplane 20 being 
separate therefrom. The processing engine 100 could, 
for example be a DSP separate from and mounted on a 40 
backplane 20 supporting a backplane bus 22, periph- 
eral and external interfaces. The processing engine 100 
could, for example, be a microprocessor rather than a 
DSP and could be Implemented in technologies other 
than ASIC technology The processing engine, or a 45 
processor including the processing engine, could be 
implemented In one or more integrated circuits. 
[0025] Figure 2 illustrates the basic structure of an 
embodiment of the processing core 102. As illustrated, 
the processing core 102 includes four elements, namely so 
an Instruction Buffer Unit (I Unit) 106 and three execu- 
tion units. The execution units are a Program Flow Unit 
(P Unit) 1 08, Address Data Flow Unit (A Unit) 1 1 0 and a 
Data Computation Unit (D Unit) 112 for executing 
instructions decoded from the Instruction Buffer Unit (I 55 
Unit) 106 and for controlling and monitoring program 
flow. 

[0026] Figure 3 illustrates the P Unit 1 08, A Unit 1 1 0 



and D Unit 112 of the processing core 102 in more detail 
and.shows the bus structure connecting the various ele- 
ments of the processing core 102. The P Unit 108 
includes, for example, loop control circuitry, 
.GoTo/Branch control circuitry and various registers for 
controlling and monitoring program flow such as repeat 
counter registers and inten-upt mask, flag or vector reg- • 
isters. The P Unit 108 is coupled to general purpose 
Data Write busses (EB. FB) 130. 132. Data Read 
busses (CB, DB) 134, 136 and a coefficient program 9 
bus (BB) 138. Additionally, the P Unit 108 is coupled to 
sub-units within the A Unit 110 and D Unit 112 via vari- 
ous busses labeled CSR, ACB and RGD. 
[0027] As illustrated in Figure 3. in the present 
embodiment the A Unit 110 includes a register file 30, a 
data address generation sub-unit (DAGEN) 32 and an 
Arithmetic and Logic Unit (ALU) 34. The A Unit register 
file 30 iricludes various registers, among which are 16 
bit pointer: registers (ARO. .... AR7) and data registers 
(DRO, .... DR3) yyhich may also be used for data flow as 
well as address generation. Additionally, the register file 
includes 16 bit circular buffer registers and 7 bit data 
page registers. As well as the general purpose busses 
(EB, FB, CB, DB) 130, 132. 134, 136, a coefficient data 
bus 140 and a coefficient address bus 142 are coupled 
to the A Unit register file 30. The A Unit register file 30 is 
coupled to the A Unit DAGEN unit 32 by unidirectional 
busses 144 and 146 respectively operating in opposite 
directions. The DAGEN unit 32 includes 16 bit XA" reg- 
isters and coefficient and stack pointer registers, for 
example for controlling and monitoring address genera- 
tion within the processing engine 100. 
[0028] The A Unit 110 also comprises the ALU 34 
which includes a shifter function as well as the functions 
typically associated with an ALU such as addition, sub- 
traction, and AND, OR and XOR logical operators. The 
ALU 34 is also coupled to the general-purpose busses 
(EB, DB) 130, 136 and an instruction constant data bus 
(KDB) 140. The A Unit ALU is coupled to the P Unit 108 
by a PDA bus for receiving register content from the P 
Unit 108 register file. The ALU 34 is also coupled to the 
A Unit register file 30 by busses RGA and RGB for 
receiving address and data register contents and by a 
bus RGD for forwarding address and data registers in 
the register file 30. 

[0029] As illustrated, the D Unit 112 includes a D 

Unit register file 36, a D Unit ALU 38. a D Unit shifter 40 

and two multiply and accumulate units (MAC1, MAC2) 

42 and 44. The D Unit register file 36, D Unit ALU 38 

and D Unit shifter 40 are coupled to busses (EB, FB, 

CB, DB and KDB) 130, 132, 134, 136 and 140, and the 

MAC units 42 and 44 are coupled to the busses (CB, 

DB, KDB) 134. 136. 140 and data read bus (BB) 144. * 

The D Unit register file 36 includes 40-bit accumulators 

(AGO AC3) and a 16-bit transition register. The D 

Unit 112 can also utilize the 16 bit pointer and data reg- 
isters in the A Unit 110 as source or destination regis- 
ters in addition to the 40-bit accumulators. The D Unit 
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register file 36 receives data from the D Unit ALU 38 
and MACS 1&2 42. 44 over accunnulator wrlie busses 
(ACWO, ACW1) 146, 148. and fronn the D Unit shifter 40 
over accumulator write bus {ACW1) 148. Data is read 
from the D Unit register file accumulators to the D Unit 
ALU 38, D Unit shifter 40 and MACs 1&2 42. 44 over 
accumulator read busses (ACRO. ACR1) 150, 152. The 
D Unit ALU 38 and D Unit shifter 40 are also coupled to 
sub-units of the A Unit 108 via various busses labeled 
EFC. DRB, DR2 and ACB. 

[0030]. Referring now to Figure 4, there is Illustrated 
an Instruction buffer unit 106 comprising a 32 word 
instruction buffer queue (IBQ) 502. The IBQ 502 com- 
prises 32x16 bit registers 504. logically divided into 8 bit 
bytes 506. Instructions arrive at the IBQ 502 via the 32- 
bit program bus (PB) 122. The instructions' are fetched 
in a '32-bit cycle into the location pointed to by the Local 
Write Program Counter (LWPC) 532. The LWPC 532 is 
contained in a register located In the P Unit 108. The P 
Unit 108 also includes the Local Read Program Counter 
(LRPC) 536 register,* and the Write Program Counter 
(WPC) 530 and Read Program Counter (RPC) 534 reg- 
isters. LRPC 536 points to the location In the IBQ 502 of 
the next instruction or instructions to be loaded into the 
instruction decoder(s) 512 and 514. That is to say. the 
LRPC 534 points to the location in the IBQ 502 of the 
instruction currently being dispatched to the decoders 
512, 514. The WPC points to the address in program 
memory of the start of the next 4 bytes of instruction 
code for the pipeline. For each fetch into the IBQ, the 
next 4 bytes from the program memory are fetched 
regardless of instruction boundaries. The RPC 534 
points to the address in program memory of the instruc- 
tion cun-ently being dispatched to the decoder(s) 512 
and 514. 

[0031] The instructions are formed into a 48-bit 
word and are loaded into the instruction decoders 512, 
514 over a 48-bit bus 516 via multiplexors 520 and 521 . 
It will be apparent to a person of ordinary skit! in the art 
that the instructions may be formed into words compris- 
ing other than 48-bits, and that the present invention is 
not limited to the specific embodiment described above. 
[0032] The bus 516 can load a maximum of two 
instructions, one per decoder, during any one instruc- 
tion cycle. The combination of instructions may be in 
any combination of formats, 8. 16. 24, 32, 40 and 48 
bits, which will fit across the 48-bit bus. Decoder 1,512. 
is loaded in preference to decoder 2, 514. if only one 
instruction can be loaded during a cycle. The respective 
instructions are then forwarded on to the respective 
function units in order to execute them and to access 
the data for which the Instruction or operation is to be 
performed. Prior to being passed to the instruction 
decoders, the instructions are aligned on byte bounda- 
ries. The alignment is done based on the format derived 
for the previous instruction during decoding thereof. The 
multiplexing associated with the alignment of instruc- 
tions with byte boundaries is performed in multiplexors 



520 and 521. 

[0033] The processor core 102 executes instruc- 
tions through a 7 stage pipeline, the respective stages 
of which wilt now be described with reference to Figure 

5 5. 

[0034] The first stage of the pipeline is a PRE- 
FETCH (PO) stage 202. during which stage a next pro- 
gram memory location is addressed by asserting an 
address on the address bus (PAB) 118 of a memory 
10 interface, or memory management unit 104. 

[0035] In the next stage. FETCH (PI) stage 204, 
the program memory is read and the I Unit 106 is filled 
via the PB bus 122 from the memory management unit 
104. 

15 [0036] The PRE-FETCH and FETCH stages are 
separate from the rest of the pipeline stages in that the 
pipeline can be interrupted during the PRE-FETCH and 
FETCH stages to break the sequential program flow 
and point to other instructions in the program memory. 

20 for example for a Branch instruction. 

[0037] The next instruction in the instruction buffer 
is then dispatched to the decoder/s 512/514 In the third 
stage. DECODE (P2) 206, where the instruction is 
decoded and dispatched to the execution unit for exe- 

25 cuting that instruction, for example to the P Unit 1 08. the 
A Unit 110 or the D Unit 112. The decode stage 206 
includes decoding at least part of an instruction includ- 
ing a first part indicating the class of the instruction, a 
second part indicating the format of the instruction and 

30 a third part indicating an addressing nnode for the 
instruction. 

[0038] The next stage is an ADDRESS (P3) stage 
208, in which the address of the data to be used in the 
instruction is computed, or a new program address is 

35 computed should the instruction require a program 
branch or jump. Respective computations take place in 
the A Unit 1 10 or the P Unit 108 respectively. 
[0039] In an ACCESS (P4) stage 210 the address 
of a read operand is output and the memory operand. 

40 the address of which has been generated in a DAGEN 
X operator with an Xmem indirect addressing mode, is 
then READ from indirectly addressed X memory 
(Xmem). 

[0040] The next stage of the pipeline is the READ 
45 (P5) stage 212 in which a memory operand, the 
address of which has been generated In a DAGEN Y 
operator with an Ymem indirect addressing mode or in 
a DAGEN C operator with coefficient address mode, is 
READ. The address of the memory location to which the 
50 result of the instruction is to be written is output. 

[0041] In the case of dual access, read operands 
can also be generated in the Y path, and write operands 
in the X path. 

[0042] Finally, there is an execution EXEC (P6) 
55 stage 214 in which the instruction is executed in either . 
the A Unit 110 or the D Unit 112. The result is then 
stored in a data register or accumulator, or written to 
memory for Read/Modify/Write or store instructions. 
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Additionally, shift operations are performed on data in 
accumulators during the EXEC stage. 
[0043] The basic principle of operation for a pipeline 
processor will now be described with reference to Fig- 
ure 6. As can be seen from Figure 6, for a first instruc- 
tion 302, the successive pipeline stages take place over 
time periods T^Tj. Each time period is a clock cycle for 
the processor machine clock. A second instruction 304, 
can enter the pipeline in period T2, since the previous 
instruction has now moved on to the next pipeline stage. 
For instruction 3, 306, the PRE-FETCH stage 202 
occurs in time period T3. As can be seen from Figure 6 
for a seven stage pipeline a total of 7 instructions may 
be processed simultaneously. For all 7 instructions 302- 
314, Figure 6 shows them all under process in time 
period T7. Such a structure adds a form of parallelism to 
the processing of instructions. 

[0044] As shown in Figure 7. the present embodi- 
ment of the invention includes a memory management 
unit 104 which Is coupled to external memory units via a 
24 bit address bus 114 and a bi-directional 16 bit data 
bus 116. Additionally, the memory management unit 
104 is coupled to program storage memory (not shown) 
via a 24 bit address bus 118 and a 32 bit bi-directional 
data bus 120. The memory management unit 104 is 
also coupled to the I Unit 106 of the machine processor 
core 102 via a 32 bit program read bus (PB) 122. The P 
Unit 108, A Unit 110 and D Unit 112 are coupled to the 
memory management unit 104 via data read and data 
write busses and corresponding address busses. The P 
Unit 108 is further coupled to a program address bus 
128. 

[0045] , More particularly the P Unit 108 is coupled 
to the memory management unit 104 by a 24 bit pro- 
gram address bus 128, the two 16 bit data write busses 
(EB, FB) 130, 132, and the two 16 bit data read busses 
(CB. DB) 134, 136, The A Unit 110 is coupled to the 
memory management unit 104 via two 24 bit data write 
address busses (EAB, FAB) 160. 162, the two 16 bit 
data write busses (EB, FB) 130. 132, the three data 
read address busses (BAB. CAB. DAB) 164, 166, 168 
and the two 16 bit data read busses (CB, DB) 134, 136. 
The D Unit 112 is coupled to the memory management 
unit 104 via the two data write busses (EB, FB) 130, 132 
and three data read busses (SB, CB, DB) 144, 134, 136. 
[0046] Figure 7 represents the passing of instruc- 
tions from the I Unit 106 to the P Unit 108 at 124, for for- 
warding branch instructions for example. Additionally, 
Figure 7 represents the passing of data from the I Unit 
1 06 to the A Unit 1 1 0 and the D Unit 1 1 2 at 1 26 and 1 28 i 
respectively. 

[0047] In accordance with a preferred embodiment 
of the invention, the processing engine is configured to 
respond to a local repeat instruction which provides for 
an iterative looping through a set of instructions all of 
which are contained in the Instruction Buffer Queue 
502, The local repeat instruction is a 16 bit instruction 
and comprises: an op-code; parallel enable bit; and an 



offset (6 bits). 

[0048] The op-cbde defines the instruction as a 
local instruction, and pronnpts the processing engine to 
expect the offset and bp-code extension. In the 
5 described embodiment the offset has a maximum value 
of 56, which defines the greatest size of the local loop 
as 56 bytes of instruction code: • 
[0049] Referring now to Figure 4, the IQB 502 is 64 
bytes long and is organised into 32x16 bit words. 
10 Instructions are fetched into IQB 502 2 words at a time. • 
Additionally, the Instruction Decoder Controller reads a 
packet of up to 6 program code bytes into the instruction 
decoders 512 and 514 for each Decode stage of the 
pipeline. The start and end of the loop, ie first and last 
15 instructions, may fall at any of the byte boundaries 
within the 4 bytie packet of program code fetched to the 
IQB 502. Thus, the start(first) and end(last) instructions 
are not necessarily co-terminous with the top and bot- 
tom of IQB 502." For example, in a case where the local 
20 loop instruction spans two bytes across the boundary of 
a packet of 4 program codes, both the packet of 4'pro- 
'gram codes must be retained in the IQB 502 for execu- 
tion of the local loop repeat. . In order to take this into 
account the local loop instruction offset is a maximum of 
25 56bytes. 

[0050] When the local loop instruction is decoded 
the start address for the local loop, ie the address after 
the local loop instruction address, is stored in the Block 
Repeat Start Addresso (RSAq) register which is located, 

30 for example, in the P unit 108. After the initial pass 
through the loop, the Read Program Counter (RPC) is 
loaded with the contents of RSAq for re-entering the 
loop. The location of the end of the' local loop is com- 
puted using the offset, and the location is stored in the 

35 Block Repeat End Address^ (REA^) register which may 
also be located in the P unit 108, for example. Two 
repeat start address registers and two repeat and 
address registers (RSAq, RSA^, REAq, REA^) are pro- 
vided for nested loops. For nesting levels greater that 

40 two, preceding start/end addresses are pushed to a 
stack register. 

[0051] During the first iteration of a local loop, the 
proigram code for the body of the loop is loaded into the 
IBQ 502 arid executed as usual. However, for the follow- 

t5 ihg iterations no fetch will occur until the last iteration, 
during which the fetch will restart. 
[0052] Referring now to Figure 8, the local loop 
instruction flow for the preferred embodiment will be 
described. The local loop repeat is set up by initialising 

ro a Block Repeat Count (BRC0/BRC1), shown in the 
DECODE stage in a first pipeline slot 602, with the 
number of iterations of the local loop, and then in the 
next slot 604 the local loop instruction (RPTL) itself is * 
decoded. The BRC0/BRC1 is decremented for each 

5 repeat of the instructions. It will be evident to a skilled 
person that optionally the local loop repeat maybe set 
up by defining a maximum iteration value, and initialis- 
ing a counter to zero. The counter can then be incre- 
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mented for each repeat of the instructions. : The 
decrement or increment may be in steps other than one. 
During slots 602 and . 604. the Program Counter 
increases by 4 bytes to PC. and 2 further instruction 
words are fetched into the IQB 502. thus 2 instruction 
words per slot 602. 604 are fetched into IQB 502. In slot 
602 the number of words 504 available in the IQB 502 is 
2. and is shown labelled Count in Figure 8. The number 
of words available in the IBQ 502 is given by the differ- 
ence between the LRPC 536 and the LWPC 532. since 
they respectively point to the currently dispatched 
instruction and the location for writing the next instruc- 
tion into the IBQ 502. Since, for the purposes of this 
embodiment, the. instruction which initialises the 
BRC0/BRC1 is a one word 16 bit instruction, for exam- 
ple -and BRC0/BRC1 =DAx connprises no parallelism, 
only the 16 bit initialisation instruction is dispatched to 
the first or second instruction decoder:512. 514 in slot 
602. . . 

[0053], For the next slot 604. the WRC increases by 
4 to PC and a further-2 x 1 6 bit instruction words 504 are 
fetched in the IQB 502. The number of instruction words 
504 available in the IBQ 502 is now 3. since only the 1 
word instruction initialising BRC0/BRC1 was dispatched 
during the previous slot 602. 

[0054] The first iteration of the local loop begins at 
slot 606. where a first parallel pair of instructions Lq. Li 
are dispatched to the decoders 512,514. The number 
of instruction words 504 which are available ip the IBQ 
502 is now 4. This is because in the present embodi- 
ment the local loop instruction is only a 16 bit instruction 
and therefore only one word 504 was dispatched to the 
decoder 512 during the previous slot 604. 
[0055] In order to optimise the execution of the local 
loop, the instructions are executed in parallel so far as is 
possible. In the present example, it is assumed that all 
instructions comprising the body of the loop are execut- 
able in parallel. This results in two unused slots. 610. 
612 during the first pass of the. body of the loop, but 
leads to greater speed for the rest of the iterations. 
[0056] Additionally, for the present example instruc- 
tions Lq. L-i are executable in parallel and comprise a 
total of 48 bits, thus 3 instruction, words 504 are .dis- 
patched to the decoders 512, 514 for each decode 
stage. For the start of the repeat block, cycle 606. two 
instructions Lq and L^are dispatched to the decoders 
and the difference between the LRPC 536 and the 
LWPC 532 is 4. In cycle 608 a further two Instruction 
words are fetched into the IBQ. but three words are dis- 
patched. 

[0057] The LRPC 536 now moves 3 words along 
the IQB 502. and the LWPC 532 moves 2 words along 
the IQB 502 to the next fetch location. Thus, the differ- 
ence between LWPC 532 and LRPC 536 is decreased 
by one to three for the next slot 608 Again, assuming the 
next two instructions L2. L3 are executable in parallel 
and comprise a total of 48 bits the LRPC 532 moves 3 
words along the IQB 502 ready for the next slot 610. 



The program pre-fetch is halted for one slot, in this case 
slot 608.. and therefore no instruction words are loaded 
into the IQB 502 for this slot.' Thus, for slot 610 the 
LRPC 536 and LWPC 532 point to the same IQB 502 
5 address, and Count = 0. Since there are no available 
bits for dispatch in the IQB 502, slot 610 is an unused 
slot for decoding. However. 2 instruction words are 
fetched into the IQB 502 during slot 510 moving LWPC 
532 along IQB by two words, and therefore there are 2 
^0 instruction words available for slot 612. However; if the 
next two instructions. L4, L5, are parallel instructions 
comprising 48 bits then there is no dispatch in slot 612, 
and there is a further unused slot 
[0058] For slot 614 there are a total of 4 instruction 
15 words 504 available in the IQB 502, and instructions L4. 
L5, comprising 48 bits are dispatched to decoders 512. 
514. A further two instructipn words 504 are fetched into 
the IQB 502 during sloL 614. The WPC has now 
increased by 1.6 packets of 2 x instruction words 504. 
20 and thus the IQB 502 is full and all the loop body has 
been fetched. Thus, as can be seen, the WPC count for 
slot 616 remains at PC + 16 for the Pre-Fetch, although 
a further 2 words 504 are fetched into the IQB 502 orig- 
inating from the pre-fetch of slot 614. 
25 [0059] For slot 616 the body of the loop has been 
fetched into the IQB 502, and there are 32 words avail- 
able in the IQB. This is the maximum size of, the IQB 
502. and hence the fetch is switched off for further slots 
618, 620 onwards forming further iterations of the loop. 
30 [0060] For the last iteration of the loop, the fetch is 
switched back on in slot 626 in order to top' up the IQB 
502 to avoid any gaps in the queue. 
[0061] Thus, for the body of the loop, excluding the 
first and last iteration there is no pipeline, fetch phase. 
35 Thus, there is no program memory access. This 
reduces power consumption during the loop compared 
to conventional loops, since fewer program memory 
accesses are performed. 

[0062] in a further pretended embodiment the 
40 processing engine is configured to align instruction 
words in the IBQ 502 in order to maximise the block size 
for a local loop. The alignment of the instruction words 
may operate to place start and end instructions for a 
local loop as close to respective boundaries of the IQB 
45 .502 as possible. 

[0063] The processing, engine may include a 
Assembler which configures the alignment of instruc- 
tions in the IQB 502 to maximise the block size for a 
local loop. 

50 [0064] Embodiments of the invention are suitable 
for wireless communication devices such as a radio tel- 
ephone 50 illustrated in Fig. 9. Such a radio telephone 
50 comprises a user interface 52 including a display 54 
such as a liquid crystal display, and a key pad 56. The 
55 radio telephone also includes a transceiver, not shown, 
and an antenna 58. 

[0065] In view of the foregoing description it will be 
evident to a person skilled in the art that various modifi- 
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cations may be made within the scope of the invention. 
For example, the instructions comprising the body of the 
loop need not be full 48 bit parallel instructions, or even 
parallel instructions at all. Additionally, the loop need not 
take up all of the IQB. but may be smaller than that 5 
described above. Furthermore, the program memory 
nriay comprise a memory cache. 

[0066] The scope of the present disclosure includes 
any novel feature or combination of features disclosed 
therein either explicitly or implicitly or any generalisation 10 
thereof irrespective of whether or not it relates to the 
claimed invention or mitigates any or all of the problems 
addressed by the present invention. The applicant 
hereby gives notice that new claims may be formulated 
to such features during the prosecution of this applica- is 
tion or of any such further application derived therefrom. 
In particular, with reference to the appended claims, 
features from dependant claims may be combined with 
those of the independent claims in any appropriate 
manner and not merely in the specific combinations 20 
enumerated In the claims. 

Claims 

1. A processing engine, comprising an execution unit 25 
for executing machine-readable instructions dis- 
patched thereto, and an instruction buffer for tem- 
porarily storing a plurality of machine-readable 
instructions transferred thereto prior to dispatch to 
the execution unit, the execution unit responsive to 30 
a first machine-readable instruction to initiate itera- 
tive execution of a block of machine-readable 
instructions residing in said instruction buffer, said 
block including an initial instruction and a final 
Instruction 35 

2. A processing engine according to claim 1, wherein 
said first instruction comprises an offset indicative 
of a position in said buffer of said final instruction 
relative to a position in said buffer of said initial 40 
instruction. 

3. A processing engine according to claim 1 or claim 
2, further comprising an Instruction pipeline having 
a plurality of pipeline stages, said Instruction pipe- 45 
line providing an instruction fetch stage for fetching 
instruction code from a program memory for trans- 
fer into the instruction buffer. 

4. A processing engine according to claim 3, operable 50 
to inhibit said instruction fetch stage subsequent to 
fetching said final instruction from the program 
memory Into the instruction buffer. 



6. A processing engine according to any preceding 
claim, further comprising a local write program 
counter for pointing to a location in said instruction 
buffer for writing instruction code from said program 
memory, and a local read program counter for 
pointing to a location in said instruction buffer corre- 
sponding to a currently dispatched instruction, and 
wherein the processing engine is responsive to a 
predetermined separation of said local write pro- 
gram counter and said local read program counter 
equal to inhibit transfer of instruction code to said 
instruction buffer 

7. A processing engine according to claim 6, wherein 
said predetermined separation Is equal to the size 
of said instruction buffer. 

8. A processing engine according to any preceding 
claim responsive to' a second instruction to store an 
iteration* count for repeat execution initiated by said 
initial instruction. 



9. 



A processing engine according to claim 8. operable 
to decrement said iteration count for a pass of said 
block of instructions. 



A processing engine according to claim 3 or claim 
4, operable to inhibit said instruction fetch stage for 
said instruction fetch stage for said instruction 
buffer being frill. 



55 



10. A processing engine according to claims, or any of 
claims 4 to 9 dependent on claim 3, operable to 
restart said instruction code fetch stage responsive 
to a last iteration of said block of instructions. 

11. A processing engine according to claim 10. opera- 
ble to re-start said instruction code fetch stage 
responsive to dispatch of said initial instruction for 
said final iteration. 

12. A system comprising a processing engine accord- 
ing to any preceding claim and a computer includ- 
ing a compiler or assembler, said compiler or 
assembler responsive to an operator directive to 

- align an Instruction with a boundary of said instruc- 
tion buffer. 

13. A system according to claim 12, wherein said 
instruction is said initial instruction. 

14. A system according to claim 12 or claim 13, 
wherein said instruction is said final instruction. 

15. A system according to any of claims 12 to 14, said 
compiler or assembler responsive to said operator 
directive to insert a no operation code in said block 
of code for aligning said initial and/or final instruc- 
tion with said boundary. 

16. A system according to any of claims 12 to 15, 
wherein said compiler or assembler checks said 
offset is not greater than a threshold value. 
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17. A data processing system, according to , da inn 16. 
wherein said threshold value is dependent on the 
instruction buffer size. 

18. A processor connprising a processing engine 
according to any of clainns 1 to 1 1 . 

19. An integrated circuit comprising a processing 
engine according to any of claims 1 to 11 . 

20. A digital signal processor comprising a processing 
engine according to any one of claims 1 to 1 1 . 

21. An integrated circuit comprising a digital signal 
processor according to claim 20., • , , . 

22. A method for operating a processing engine, com- 
prising temporarily storing a plurality of instructions 
in an instruction buffer, and repeatedly executing a 
block of instructions of said plurality of instructions. 

23. A method according to claim 22, further comprising 
fetching instruction code from program memory for 
transfer into said instruction buffer for forming said 
plurality of instructions. 

24. A method according to claim 23, further comprising 
inhibiting fetching instruction code subsequent to 
fetching a final instruction to said block of instruc- 
tions. 



27. A method according to claim 26, wherein said pre- 
determined value is the instruction buffer size. 
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20 



starting said fetching for a last repeat of said body 
of instructions. 

31. A method according to any of claims 22 to 30, fur- 
ther comprising aligning an initial and/or final 
instruction for said block of instructions with a 
respective boundary of said instruction buffer. 

32. A method according to any of claims 22 to 30, fur- 
ther comprising compiling or assembling said 
instructions forming said block of instructions for 
aligning an initial and/or final instruction for said 
block of instructions with a respective boundary of 
said instruction buffer. 

33. A telecommunications device comprising a 
processing engine according to any of claims 1 to 

34. A telecommunications device operable in accord- 
ance with a method of any of claims 22 to 32. 



35 



25 



30 



25. A method according to claim 23 or 24, further com- 
prising inhibiting fetching instruction code for trans- 
fer to said instruction buffer for said instruction 
buffer being full. ^ 



26. A method according to any of claims 23 to 25, fur- 
ther comprising monitoring a first location in said 
instruction, buffer for an instruction being executed 
and a second location in said instruction buffer for 
writing an instruction transferred from program 
memory, and inhibiting said fetching for a difference 
between said first and second locations equalling a 
predetermined value. 



40 



45 



A wireless communications device comprising a tel- 
ecommunications device according to claim 33 or 
34, a user.interface including a display, a keypad or 
keyboard for inputting data to the communications 
device, a transceiver and an antenna. 



28 



29, 



A method according to any of claims 22 to 28, fur- 
ther comprising, storing an iteration count for said 
repeat execution of said block. 



50 



A method according to claim 28, further comprising 
decrementing said iteration count for a pass of said 
block of instructions. ^5 



30. A method according to claim 23 or any of claims 24 
to 29 dependent on claim 23, further comprising re- 
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